These materials have been taken from the Software Carpentry: R Novice Lesson. You can find the original materials here


This lesson will cover some basic functions that can be used to manipulate data in R. Again, we will be using the gapminder data set, which includes country information on GDP, population, etc.


There are five main functions we’ll be talking about today, each allowing us to manipulate data frames. These five functions are:


If you haven’t already, make sure you have dplyr() and gapminder() installed and loaded with the following commands:

# Download the packages
# install.packages(c("tidyverse", "gapminder"))

# Load the packages for use
library(tidyverse)
gapminder <- read.csv("../data/gapminder_data.csv")


Let’s take a quick look at our data frame to remind ourselves of its structure. We do this using the head() command, which will display the first 10 rows (given by n = 10) of our data frame.

head(gapminder, n = 10)
       country continent year lifeExp      pop gdpPercap
1  Afghanistan      Asia 1952  28.801  8425333  779.4453
2  Afghanistan      Asia 1957  30.332  9240934  820.8530
3  Afghanistan      Asia 1962  31.997 10267083  853.1007
4  Afghanistan      Asia 1967  34.020 11537966  836.1971
5  Afghanistan      Asia 1972  36.088 13079460  739.9811
6  Afghanistan      Asia 1977  38.438 14880372  786.1134
7  Afghanistan      Asia 1982  39.854 12881816  978.0114
8  Afghanistan      Asia 1987  40.822 13867957  852.3959
9  Afghanistan      Asia 1992  41.674 16317921  649.3414
10 Afghanistan      Asia 1997  41.763 22227415  635.3414


Quick aside on tibbles and data frames

In R, one of the main types of objects/variables we’re going to be working with is a data frame. This is much like a table you would view in Excel, where column represent variables or measures and rows represent measurements, samples, or observations.


Choose Columns: select

The first function we’ll be using is select(). This function let’s us pick columns from our data frame, based on name (e.g. year) or by index (e.g. 3).

Let’s try using select() to pick out a few columns: “country”, “year”, “lifeExp”, and “pop”. We’ll be assigning these columns to a new data frame, gapminder_select. Then we’ll use head() to see if it worked.

# select() code here
gapminder_select <- select(gapminder, country, year, lifeExp, pop)

# Check the data frame
head(gapminder_select, n = 10)
       country year lifeExp      pop
1  Afghanistan 1952  28.801  8425333
2  Afghanistan 1957  30.332  9240934
3  Afghanistan 1962  31.997 10267083
4  Afghanistan 1967  34.020 11537966
5  Afghanistan 1972  36.088 13079460
6  Afghanistan 1977  38.438 14880372
7  Afghanistan 1982  39.854 12881816
8  Afghanistan 1987  40.822 13867957
9  Afghanistan 1992  41.674 16317921
10 Afghanistan 1997  41.763 22227415


As you can see, our new data frame contains only a subset of the columns from the original data frame, based on the names we provided in the select() command.


Here we’ll also introduce another great feature of dplyr(): the pipe ( %>% ). This symbol sends or pipes an object (e.g. a data frame like gapminder) INTO a function (e.g. select()).

So, the above select() command can be rewritten as follows (NOTE: the “.” is a placeholder, which represents the object being piped). Again, we can check our result using head().

# select() using pipe syntax
gapminder_pipe <- gapminder %>% select(., country, year, lifeExp, pop)

head(gapminder_pipe, n = 10)
       country year lifeExp      pop
1  Afghanistan 1952  28.801  8425333
2  Afghanistan 1957  30.332  9240934
3  Afghanistan 1962  31.997 10267083
4  Afghanistan 1967  34.020 11537966
5  Afghanistan 1972  36.088 13079460
6  Afghanistan 1977  38.438 14880372
7  Afghanistan 1982  39.854 12881816
8  Afghanistan 1987  40.822 13867957
9  Afghanistan 1992  41.674 16317921
10 Afghanistan 1997  41.763 22227415


We can actually simplify the above command further - dplyr’s functions such as select() are smart enough that you don’t actually need to include the “.” placeholder, as shown below.

# select() using pipe syntax w/out a placeholder
gapminder_pipe2 <- gapminder %>% select(country, year, lifeExp, pop)

head(gapminder_pipe2, n = 10)
       country year lifeExp      pop
1  Afghanistan 1952  28.801  8425333
2  Afghanistan 1957  30.332  9240934
3  Afghanistan 1962  31.997 10267083
4  Afghanistan 1967  34.020 11537966
5  Afghanistan 1972  36.088 13079460
6  Afghanistan 1977  38.438 14880372
7  Afghanistan 1982  39.854 12881816
8  Afghanistan 1987  40.822 13867957
9  Afghanistan 1992  41.674 16317921
10 Afghanistan 1997  41.763 22227415


Challenge 1

Using the select() command and pipe (%>%) notation, pick the following columns from the gapminder data frame, assign them to a new variable (we’ll use x), and display the results using head(x, n = 10). Columns to choose are:

  • continent
  • GDP per capita
  • life expectancy
  • year
# Answer here:
x <- select()


Choose Rows: filter

So we’ve covered selecting columns, but what about rows? This is where filter() comes in. This function allows us to choose rows from our data frame using some logical criteria. An example is filtering for rows in which the country is Canada. This can also be applied to numerical values, such as the year being equal to 1967, or life expectancy greater than 30.

NOTE: In R, equality (e.g. country is Canada, year is 1967) is done using a double equals sign (==).

Let’s go through a couple examples.

# Filter rows where country is Canada
gapminder_canada <- gapminder %>% filter(country == "Canada")

head(gapminder_canada, n = 10)
   country continent year lifeExp      pop gdpPercap
1   Canada  Americas 1952   68.75 14785584  11367.16
2   Canada  Americas 1957   69.96 17010154  12489.95
3   Canada  Americas 1962   71.30 18985849  13462.49
4   Canada  Americas 1967   72.13 20819767  16076.59
5   Canada  Americas 1972   72.88 22284500  18970.57
6   Canada  Americas 1977   74.21 23796400  22090.88
7   Canada  Americas 1982   75.76 25201900  22898.79
8   Canada  Americas 1987   76.86 26549700  26626.52
9   Canada  Americas 1992   77.95 28523502  26342.88
10  Canada  Americas 1997   78.61 30305843  28954.93


Let’s try another one, this time filtering on life expectancy above a certain threshold:

# Filter for rows where life expectancy is greater than 50
gapminder_LE <- gapminder %>% filter(lifeExp > 50)

head(gapminder_LE, n = 10)
   country continent year lifeExp     pop gdpPercap
1  Albania    Europe 1952  55.230 1282697  1601.056
2  Albania    Europe 1957  59.280 1476505  1942.284
3  Albania    Europe 1962  64.820 1728137  2312.889
4  Albania    Europe 1967  66.220 1984060  2760.197
5  Albania    Europe 1972  67.690 2263554  3313.422
6  Albania    Europe 1977  68.930 2509048  3533.004
7  Albania    Europe 1982  70.420 2780097  3630.881
8  Albania    Europe 1987  72.000 3075321  3738.933
9  Albania    Europe 1992  71.581 3326498  2497.438
10 Albania    Europe 1997  72.950 3428038  3193.055


We can also filter with multiple arguments, each separated by a comma:

# filter() for Canada and life expectancy greater than 80
gapminder_C_LE <- gapminder %>% filter(country == "Canada", lifeExp > 80)

head(gapminder_C_LE, n = 10)
  country continent year lifeExp      pop gdpPercap
1  Canada  Americas 2007  80.653 33390141  36319.24


Challenge 2

Use filter() to choose data for African countries, from the year 1980 and onwards.

# Challenge 2 code here
x <- filter()


Create New Columns: mutate()

Let’s say we now want to calculate the GDP in billions, which is done by mutiplying the GDP per capita by the population, then dividing by 1 billion (1 * 10^9). mutate() will perform this calculation on each row in the data frame, one row at a time (i.e. row-wise). The code below will calculate the GDP in billions:

# Use mutate() to calculate GDP in billions
gapminder_gdpBil <- gapminder %>% mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_gdpBil, n = 10)
       country continent year lifeExp      pop gdpPercap gdp_billion
1  Afghanistan      Asia 1952  28.801  8425333  779.4453    6.567086
2  Afghanistan      Asia 1957  30.332  9240934  820.8530    7.585449
3  Afghanistan      Asia 1962  31.997 10267083  853.1007    8.758856
4  Afghanistan      Asia 1967  34.020 11537966  836.1971    9.648014
5  Afghanistan      Asia 1972  36.088 13079460  739.9811    9.678553
6  Afghanistan      Asia 1977  38.438 14880372  786.1134   11.697659
7  Afghanistan      Asia 1982  39.854 12881816  978.0114   12.598563
8  Afghanistan      Asia 1987  40.822 13867957  852.3959   11.820990
9  Afghanistan      Asia 1992  41.674 16317921  649.3414   10.595902
10 Afghanistan      Asia 1997  41.763 22227415  635.3414   14.121996


Combine Functions with Pipes

We’ve seen that pipes ( %>% ) can be used to send an object such as a data frame into a function, such as select(), or filter(). But they can also be used to send the output of one function into another function. This allows us to chain together multiple commmands, without the need for intermediate variables.

Let’s take a look at this in an example.

# select() the five columns, and filter() for Canada
gapminder_multi <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada")

head(gapminder_multi, n = 10)
   country year lifeExp      pop gdpPercap
1   Canada 1952   68.75 14785584  11367.16
2   Canada 1957   69.96 17010154  12489.95
3   Canada 1962   71.30 18985849  13462.49
4   Canada 1967   72.13 20819767  16076.59
5   Canada 1972   72.88 22284500  18970.57
6   Canada 1977   74.21 23796400  22090.88
7   Canada 1982   75.76 25201900  22898.79
8   Canada 1987   76.86 26549700  26626.52
9   Canada 1992   77.95 28523502  26342.88
10  Canada 1997   78.61 30305843  28954.93


We can further expand on this by incorporating our mutate() command from earlier, linking multiple functions into a single command. Be sure to indent (TAB key) when moving to a new line after a pipe.

# select() the four columns, filter() for Canada, and calculate GDP in billions
gapminder_multi_2 <- gapminder %>% 
    select(country, year, lifeExp, pop, gdpPercap) %>% 
    filter(country == "Canada") %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9)

head(gapminder_multi_2, n = 10)
   country year lifeExp      pop gdpPercap gdp_billion
1   Canada 1952   68.75 14785584  11367.16    168.0701
2   Canada 1957   69.96 17010154  12489.95    212.4560
3   Canada 1962   71.30 18985849  13462.49    255.5967
4   Canada 1967   72.13 20819767  16076.59    334.7108
5   Canada 1972   72.88 22284500  18970.57    422.7497
6   Canada 1977   74.21 23796400  22090.88    525.6835
7   Canada 1982   75.76 25201900  22898.79    577.0931
8   Canada 1987   76.86 26549700  26626.52    706.9260
9   Canada 1992   77.95 28523502  26342.88    751.3913
10  Canada 1997   78.61 30305843  28954.93    877.5034


Calculations with group_by and summarise

These functions allow us to work on our data in specific groups. For example, we can use group_by() to group observations by country, then calculate the average life expectancy for each country.

# group_by() country, calculate average life expectancy
gapminder_grp <- gapminder %>% 
group_by(country) %>% 
summarise(mean(lifeExp))

head(gapminder_grp, n = 10)
# A tibble: 10 x 2
   country     `mean(lifeExp)`
   <fct>                 <dbl>
 1 Afghanistan            37.5
 2 Albania                68.4
 3 Algeria                59.0
 4 Angola                 37.9
 5 Argentina              69.1
 6 Australia              74.7
 7 Austria                73.1
 8 Bahrain                65.6
 9 Bangladesh             49.8
10 Belgium                73.6


Let’s do another example, again grouping by country. This time, we’ll calculate the mean and standard deviation of the GDP per capita. We’ll also specify the column names inside of the summarise() command.

gapminder_mean_sd <- gapminder %>% 
    group_by(country) %>% 
    summarise(mean_gdp = mean(gdpPercap), sd_gdp = sd(gdpPercap))

head(gapminder_mean_sd, n = 10)
# A tibble: 10 x 3
   country     mean_gdp sd_gdp
   <fct>          <dbl>  <dbl>
 1 Afghanistan     803.   108.
 2 Albania        3255.  1192.
 3 Algeria        4426.  1310.
 4 Angola         3607.  1166.
 5 Argentina      8956.  1863.
 6 Australia     19981.  7815.
 7 Austria       20412.  9655.
 8 Bahrain       18078.  5415.
 9 Bangladesh      818.   235.
10 Belgium       19901.  8391.


Combining tables with left_join()

Let’s say you have a data frame you’ve generated, containing a list of gene IDs and some corresponding value, such as expression. And you have another table that contains many (e.g. all) human gene IDs, as well as the names for those genes. For example the gene ID “ENSG00000012048” corresponds to the gene BRCA1. Now let’s say you want to map between the two tables using the gene ID, to create a single table with ID-Expression-Name for each gene. That’s what left_join() is for!

This function takes one data frame “x” and using a specified column, looks for matching entries in “y”. Note that the output data frame will contain all rows and columns from “x”, as well as all columns from “y”, but only matching rows from “y”.

First let’s load some simple example data to play with:

fruits1 <- read.csv("../data/fruits_table1.csv", stringsAsFactors = FALSE)
fruits2 <- read.csv("../data/fruits_table2.csv", stringsAsFactors = FALSE)

head(fruits1)
  FruitID Fruit_Exp
1      f1         1
2      f2        -1
3      f3         0
4      f4         4
5      f5        -8
head(fruits2)
  FruitID FruitName
1      f1     apple
2      f2    orange
3      f3    banana
4      f4      pear
5      f5 blueberry
6      f6     melon

Now we can use left_join() to combine the two tables, based on matching values in a specified column. The syntax is as follow:

left_join(fruits1, fruits2, by = "FruitID")
  FruitID Fruit_Exp FruitName
1      f1         1     apple
2      f2        -1    orange
3      f3         0    banana
4      f4         4      pear
5      f5        -8 blueberry


Extra: Note that you can have different column names in each of your data frames, and still join the tables together. The syntax for this is:

left_join(x, y, by = c("columnX" = "columnY"))


Tying it all together

Now let’s use all the commands we’ve covered and combine them with pipes into a single statement.

Let’s say we want calculate the mean and SD of the GDP (in billions) for each country, but only considering data from 1980 and onwards. We can accomplish this all in one step as follows.

# select() columns, filter() by year, calculate GDP in billions, mean() and sd()
# of GDP in billions
gapminder_final <- gapminder %>% 
    select(country, year, pop, gdpPercap) %>% 
    filter(year >= 1980) %>% 
    mutate(gdp_billion = gdpPercap * pop / 10^9) %>% 
    group_by(country) %>% 
    summarise(mean_gdpBillion = mean(gdp_billion), sd_gdpBillion = sd(gdp_billion))

head(gapminder_final, n = 10)
# A tibble: 10 x 3
   country     mean_gdpBillion sd_gdpBillion
   <fct>                 <dbl>         <dbl>
 1 Afghanistan            16.4          7.66
 2 Albania                13.1          4.84
 3 Algeria               149.          33.2 
 4 Angola                 28.9         15.5 
 5 Argentina             353.          91.5 
 6 Australia             478.         154.  
 7 Austria               225.          50.3 
 8 Bahrain                12.4          5.15
 9 Bangladesh            120.          54.3 
10 Belgium               272.          54.6